﻿ How to Evaluate and Raise the Quality in a Collaborative Lexicographic Approach Dan Cristea1,2, Corina For ăscu1,3, Marius R ăschip1, Michael Zock4 1 “Alexandru Ioan Cuza” University of Ia i 2 Romanian Academy, Ia i 3 Romanian Academy, Bucharest 4 Université de Marseille à Luminy E-mail: {dcristea,mraschip,corinfor}@info uaic ro, Michael Zock@lif univ-mrs fr Abstract This paper focuses on different aspects of collaborative work used to create the electronic version of a dictionary in paper format, edited and printed by the Romanian Academy during the last century In order to ensure accuracy in a reasonable amount of time, collaborative proofreading of the scanned material, through an on-line interface has been initiated The paper details the activities and the heuristics used to maximize accuracy, and to evaluate the work of anonymous contributors with diverse backgrounds Observing the behaviour of the enterprise for a period of 6 months allows estimating the feasibility of the approach till the end of the project provide data, how to motivate them, and how to evaluate 1 Introduction their work The Web offers nowadays many digital dictionaries1 We The practical setting of our approach is the conversion of could even say that the dictionaries are going Web This the paper version of a very large dictionary into its change in support has also caused changes concerning the electronic format: the Thesaurus Dictionary of the way of contributing content Since humans speak Romanian Language, an explanatory dictionary built languages, why not ask them to fill the slots of the under the auspices of the Romanian Academy since 1913 dictionary’s microstructure? Accepting the idea that (once finished, in 2008, it will include 33 volumes, more ordinary people could contribute in building a dictionary than 15,000 pages, about 175,000 entries and more than on the basis of their knowledge of a certain language 1,300,000 examples) The dictionary was created in the implies a dramatic change of view concerning the way traditional pencil-and-paper way It includes an index for dictionary content is provided Terms like more than 2,500 volumes of the Romanian literature "crowdsourcing"2 and "digital sharecropping" (Zimmer, eDTLR is the name of the digital form of the Dictionary It 2007) have been coined and start to be used also in includes the sources in digital form and the software to linguistics Recently, there has been a raise of interest in access them acquiring lexicographic data by using free, large scale, This paper is organized as follows Section 2 presents the work, over the Web3 The reasons for this interest are two parts of the Thesaurus Dictionary of the Romanian probably economic: productive costs need to be reduced Language, as well as the eDTLR project’s objectives The method is not uncontroversial, because a Section 3 presents thee important issues in the collaborative approach might be dangerous (almost collaborative approach of building eDTLR: how to malpractice) as incompetent contributor may produce a involve many contributors, how to obtain accuracy, and lot of noise, hard to get rid of We will discuss here how to how to evaluate their work Section 4 presents the current assure quality while thousands of unknown people state in the development of eDTLR, first results and, based on them, prefigures the near future, and the last 1 There are 800 dictionaries in 160 languages at section gives conclusions http://ling kgw tu-berlin de/call/webofdic/diction4 html DA, DLR and eDTLR See also Digital dictionaries of South Asia or ARTFL at 2 www lib uchicago edu/efts/ARTFL/projects/dicos/ 2Building the content of the Thesaurus Dictionary of the http://en wikipedia org/wiki/Crowdsourcing 3Romanian Language took almost one century The old Examples are: - Wiktionary (www wiktionary org) – a multilingual series, known as the Dictionary of the Academy (DA) collection of free dictionar-ies in over 150 languages; included 5 volumes with 3,146 pages and 44 890 entries, - the Kamusi project, (www kamusiproject org) – a and has been developed between 1913 and 1947 by the Swahili-English dictionary; Romanian Academy After an interruption, the work was h - the Papillon project, restarted in the middle of the 7t decade of the last century (www papillon-dictionary org/Home po) – a multilingual with the new series, known as the Dictionary of Romanian dictionary for ori-ental and western languages; Language (DLR) Is it expected that the dictionary will be - the Inuktitut Living Dictionary finalised at the end of 2007 In all, DA and DLR will have (www livingdictionary com/backgroundandhistory jsp); 33 volumes, more than 15,000 pages, about 175,000 - the Online Slang Dictionary entries and more than 1,300,000 examples The dictionary (http://onlineslangdictionary com) was created in the traditional pencil-and-paper way, continuous processing of the huge collection of written including the index on more than 2,500 volumes of the materials appearing in one language The least output that written Romanian literature, till the nineties, when for can be imagined with such a linguistic processing editing and publication the lexicographers started to make power-plant is the discovery of new words and senses use of computers entered in language, to be forwarded to the lexicographers eDTLR (“e” stands for electronic versions) represents the for validation and inclusion in the continuously updated name of the digital form of DA+DLR (1965), including its electronic dictionary of that language Then, total lack or sources in digital form and the software to access them, as very sparse mentioning of a word or of a sense of a word well as the name of a three years project The project for some time could signal that the word/sense became focuses on three main activities: transposing onto digital obsolete and, again, this has to be considered by the format the two parts of the dictionary, as well as its lexicographers sources, correcting the digital format of eDA and eDLR, Moreover, benefits of such a large digital dictionary go and building a register of software programs which will towards computational morphology of the language (for offer browsing capabilities, including direct access from the exhaustive completion of the computational the dictionary examples onto the pages of the original morphology in both analysis and generation), as well as sources This means that, besides all kind of browsing towards the continuous enhancement of statistical-based capabilities usual in electronic dictionaries, the user will language models In computational semantics, such a also be able to click on an example and to obtain the view dictionary, due to its richness in sentences exemplifying of a segment of the page in the original document from word senses, fills-up a tremendous need for a where the example was extracted (analogue with Google sense-annotated corpus, to be used for training a word Books4) sense disambiguation program, with applications in At this phase there is no intention to acquire uniformity machine translation, information extraction, automatic within the two parts of the digital dictionary, built very detection of semantic roles of verbs and nouns derived distantly in time, nor to correct and fill the gaps in DA, from verbs supposed to reflect with less accuracy the changes in the The dictionary can be published cheaper by electronic modern language (all entries belonging to letters which means, while also providing sophisticated indexes were left unchanged for more than 50 years) It is hoped between word occurrences, including links to occurrences that the process of updating the old parts of the dictionary outside the dictionary itself, in other linguistic thesauri or to be made a lot easier by the existence of the electronic in other languages version There are many ways in which eDTLR, as a large 3 The Collaborative Approach in eDTLR dictionary built over two distinct periods in time, could be In order to reduce the expected time of proofreading taken as a creative example for developing lexicographic necessary for professional lexicographers, we designed, resources of this type, in general, not only for Romanian 6 implemented and advertised a Web-portal with an editing For instance, the vast collection of texts/attestations used window dedicated for corrections As the Academy to exemplify words and senses of the newer series of DLR imposed restrictions on the dissemination of preliminary (approximately 1,300,000 examples, representing about versions of the dictionary, for prestige and intellectual 88% of the whole text) can be used as source for updating property rights (IPR) reasons, we had to find a the articles of the entries belonging to the old series of the compromise between our needs for accuracy and the dictionary, which, as said, do not reflect any more the perspective to involve a large community of volunteering modern language proofreaders The solution was to allow users to access Then, we see eDTLR as opening the only thinkable way 7 only small extracts during editing The text displayed is for a continuous process of keeping updated the under the limit of the IPR reproduction, and is assigned dictionary thesaurus of a language, in the rhythm in which randomly every time the user asks for a new segment the language, a vivid entity, receives and accommodates When saving the processed document, it is integrated new terms and senses, and forgets (and marks as such) again in the whole This strategy prevents users to obsolete terms and senses Indeed, the society advances re-assemble large portions of the dictionary As the total towards a status in which most, if not all, of the textual amount of extracts reaches nearly 140,000, a rough resources of a language will have an electronic copy5 estimation concerning the probability to obtain a given Lemmatisation, part-of-speech tagging and detection of -55 page from 12 extracts is of the order of 10 senses procedures have already become common In order to motivate people we, decided to ‘reward’ components of the nowadays language technology So, it contributors on the basis of the quantity and quality of is foreseeable a moment when the technology and the their work The ‘reward’ consists in advertising the best computing power will reach a level which will permit a ranked volunteers and, eventually, on providing access to (parts of) the final product, once the project has reached 4 See www books google fr, for instance its end The problem that remains is how to evaluate the 5 As felt in the research programs recently initiated (see in Europe CLARIN, for instance), but also in legislative initiatives promoting the recording of all publications for 6 https://consilr info uaic ro/edtlr/ research 7 10 – 12 lines on each column quantity and quality of the work, and how to raise the contribute Moreover, the project consortium decided to level of accuracy stimulate contributors, based on an evaluation of the quantity and quality of their work The stimuli consists of 3 1 How to Raise Peoples’ Interest? advertising the best ranked contributors and, eventually, on providing access to (parts of) the final product, once Collaborative projects like Linux and Wikipedia have 8 this is finalized The remaining problem is to recognise always attracted many contributors because of the who are the people that deserve this distinction, therefore inherent intellectual challenge they pose to the volunteers to evaluate the quantity and quality of the work The task In the field of collaborative computational lexicography is not simple because counting only the number of the experience has been sometimes promising, as in the sequences sent by each participant could encourage bad case of Wiktionary, but sometimes showed a relatively practice, as for instance clicking the Save button without little feedback from the public, as in the case with projects any correction done, or typing blindly (therefore rather like Papillon, Kamusi destroying the material than improving it) We think that The main type of entry in an encyclopaedia is the article an interface which feeds back to the user in a sensible and describing notions, concepts, facts or events The correct way, by producing encouraging or thanksgiving situation is different in the case of dictionaries A messages in cases of good practice and advertisements, dictionary is basically a set of data (definition, translation, although expressed in gentle phrases, in cases of bad grammatical information, related word) associated to a quality, or even capable to totally block the access in cases headword The articles in an encyclopaedia give the of intentionally malefic interventions, can contribute in a author a great deal of freedom with respect to what s/he substantial manner to the raise of the quality This would like to focus on, what to include, at what level of presupposes the ability to appreciate, as sensible as detail, etc Hence, the writer has a lot of liberty, which is possible, the quality of the volunteered work This issue is not the case for the contributors of a dictionary, where the discussed in the following section type of information to be contributed is decided beforehand by lexicographers A dictionary entry is very rigid in terms of format and content and, usually, there are great academic debates on which words and which variants to include concerning the various senses, what a definition should look like, which specific examples to include (especially in the case of monolingual dictionaries), etc Of course, all this looks more like a Procrustean bed than a creative activity, likely to motivate people The people of the Papillon project were painfully aware of this bottleneck Actually, in order to solve it, a proposal has been made to convert the dictionary into a drill tutor or exercise generator, that is, a goal-driven, template-based sentence generator The idea was to motivate people to contribute to the data base, by generating sentences based on their contributions (sentence patterns) Unfortunately, it is still premature to Figure 1 Screenshot of the on-line corrector interface evaluate the heuristic value of this solution as the tool is dedicated to novice users still under development (Zock and Afantenos, 2007) The specific framework in which we use volunteer work in eDTLR makes the whole enterprise even more 3 2 How to Evaluate the Contributed Work? dangerous (keener to rejection) The only implication of We use manual and automatic procedures to evaluate the the user in our case is to ask her/him to improve the contributors’ work The manual evaluation is done by quality (i e correct) the output obtained from the OCR expert lexicographers during a second round of process, which is not really a very creative job Actually, the corrector is supposed to spot and correct errors in the proofreading (whose main aim still remains quality text in an editing window, provided on the right hand side enhancement) The portal allows the expert, recognised of the screen, by comparing its content with the graphical during login, to “follow basically the traces” of the same image of the same segment of text, given at the left side of volunteer proofreader By receiving the same sequence of the screen (see Figure 1) So, how can we raise peoples’ screens as the novice, the expert’s view concerning the interest for this limited and, and not very enticing task? anonymous contributor can stabilise, and, once having The most important clue remains personal motivation, formed an opinion concerning the contributor, s/he will which we hope to raise in a large number of people, to contribute to a project which has a tremendous importance for the Romanian language Calls have been 8 The consortium has not arrived yet at a consensus with spread over a diversity of channels, including mass-media respect to the channels on which eDTLR will be and Internet, but also with the occasion of different distributed (public on Internet, access based on scientific academic events University professors have subscription to different functionalities, DVD, etc ) The warmly embraced our objectives, starting to disseminate promises that we make now to our volunteer contributors the eDTLR objectives and to persuade their students to refers to the case in which the final product will not be offered freely and completely on Internet have to rank the amateur contributor on a scale (ranging completed a single correction extract, 49 performed 2 - 5 from excellent to malign) by using a facility offered by the corrections, 9 from 6 - 10, 15 from 11 - 30, 8 from 31 to interface 130 and 3 – more than 130 extracts An exceptionally Next, the contributor’s record includes also an dedicated student proofread 312 screens for a period of automatically computed evaluation based on the two months Overall, the effort amounts to 114 corrected following criteria and heuristics: dictionary pages contributed by less than 20% of the - if a screen is saved without any key stroke during registered students Although we expected more, the the editing phase, this is probably due to a lack resulting profile corresponds to our Computer Science of attention and should be ignored As a result, students, who tend to be more interested in evaluating the the user will be demoted; online interface than to help with the proofreading Two - if the edit distance for the input segment (as out of the 119 students were easily identified as ill given by the OCR) and the output saved by the intended by simple heuristics user exceeds a certain threshold9, than the The main features affecting the speed of correction were correction is unreliable and should be ignored the extraction size, the OCR error rate and the ergonomics Again, the user is demoted; of the editing interface With an extract of 10 to 12 lines, - if there is next to no difference between the the correction time of a screen ranges between 30-120 results of the expert and the prior results of a non seconds, with an average of 92 seconds expert, the non-expert will be promoted Based on these counts, we estimated the effort and the total number of people required to accomplish the whole 3 3 How to Obtain Accuracy? task of correction of the first phase The total correction Unlike Wikipedia, which tolerates a certain level of noise time estimated amounts to 3,577 hours, therefore 447 or incorrect material, the work we describe has to comply days, 8 working hours each (140,000 segments * 92 with the rigor and strict rules of the Academy, hence it is seconds / 3,600 sec/h / 8) Hence, if the average correction incompatible with errors There are several ways in which effort noticed during the first 6 month of the experimental to deal with accuracy setting is kept constant, the first phase will be finished First and foremost, the whole material of the dictionary within less than two years, conforming to the plan will be corrected three times in a cascaded approach, Concerning the second estimation we considered a time meaning that the second and third correction phases are frame of two years, estimating as 2,959 the number of applied over previously corrected versions and not over collaborators to be involved, supposed to work at the the original As shown already, the first phase is rhythm observed (if 119 people have corrected 114 pages performed by (anonymous) contributors, while the in 6 months, then 11,836 people are needed to correct remaining are executed by expert lexicographers 11,339 pages also in 6 months Hence, only 2,959 people Secondly, a large source of errors, which come from the are needed to correct 11,339 pages in 24 months) abbreviations of the bibliographical sources, is reduced in Both estimates are optimistic The time estimate is clearly post-OCR processing The dictionary has approximately below the interval limit of the project Finding about 1 3 million citations, included to illustrate the use of word 3,000 people willing to work on this task will not be easy senses For each citation a bibliographical reference is Bear in mind though that the initial setting was a given Having the full list of bibliographical references community of students in Computer Science, with allowed us to identify the majority of them in the text preoccupations rather remote from computational 0 using approximate string matching1 and confidence lexicography Moreover, the students were selected from values from OCR tool Third, the interface has an a single faculty of one university in a country speaking ergonomic design, which, among others, allows zooming Romanian We expect that the invitation to contribute into different zones of the scanned image, in order to addressed to all categories of students over a larger bring closer to the corrector’s eyes portions otherwise territory will receive a much higher participation rate difficult to read or to understand Fourth, the user has the We investigate the idea of using free of errors text to possibility to mark certain zones of characters as improve the OCR rate of success in a continuous way The uncertain, attracting thus the lexicographer’s attention to volumes printed since nineties have been computer edited them for the next correction phase As such, a small part of the dictionary content exists in electronic form free from errors and, therefore, will not 4 Practicing the Collaborative Approach need any correction This material can be used to train For the time being we have released a prototype of the OCR programs online interface, which has been advertised among Another way to improve the OCR accuracy is by using an students in Computer Science During a period of 6 iterative process In the current implementation, OCR months, in which we issued two calls, 119 students, out of processing, page splitting into extracts and randomization a population of 1200, registered From these, 35 students of user access to extracts are performed in this sequence, once for the whole text Starting with a small amount of validated extracts, therefore as issued by the expert 9 The threshold is continuously updated based on extracts correctors, the process could be iterated, training the OCR verified by experts for each dictionary volume engine and thus reducing the error rate in the next steps 10 http://www dcc uchile cl/~gnavarro/software The number of extracts processed in a step will follow an Zimmer, B (2007) Charting the Digital Future of exponential growth rate For instance, at the first step, 8 Dictionary Research: Prospects for Online pages could be chosen randomly, page splitting will Collaborative Lexicography, communication at DSNA follow and randomization of the obtained extracts In the Chicago next step 16 pages will be processed, then 32, so on The Zock, M, Afantenos S , (2007) Let’s get the student into OCR training will happen on parallel, before the users the driver’s seat The Seventh International Symposium actually consume the currently extracts under processing on Natural Language Processing, Pattaya, Thailand The training cycle will be triggered by an alarm, which is chosen by taking into consideration estimations of correction time on one side and training plus processing time on the other side Training will not start without the validation of experts for the current corrected extracts Training on a very large corpus is not feasible and will be stopped when the accuracy will not improve significantly Then, the processing of the remaining text could be done in a single sequence 1 A different approach is that used in the Recaptha project1 5 Conclusion The paper studies different aspects related to collaborative approaches dedicated to lexicography, in the context of a project aiming to build one of the biggest digital dictionaries in the world The setting is that of acquiring accurate data after scanning and processing by an OCR tool The aspects we focus on are of great interest in a context where the acquisition of a very large, yet extremely reliable collection of lexicographic data is at stake at affordable costs The question is how to discourage dissemination of unaccomplished or unreliable lexicographic material, while attracting a large community of volunteer contributors There are also the problems of accuracy within a given setting, the problem of having many people with various backgrounds working together; how to motivate the largest number of potential reviewers; why it is important to evaluate contributors and how to do that Our proposed set of heuristics has been partly validated in an implementation Based on the observation of the behaviour of the system over a period of 6 months, we were able to foresee the evolution of the enterprise until its final accomplishment Contrary to other collaborative initiatives in lexicography, with unlimited perspectives concerning data acquisition, the scope of our work is clearly limited as is confined only to quality checking (proofreading) of dictionary entries We give precise estimates showing that a collaborative approach could be a success despite the fact that the job in itself is not really very enticing 6 Acknowledgements The research described in this paper is partially supported from the Romanian Ministry of Education, Research and Youth contract eDTLR, no 91 013/2007 7 References *** (1965) Dictionary of Romanian Language New Series Tome VI Romanian Academy 11 http://recaptcha net/ 